STRIDE: a command-line HMM-based identifier and sub-classifier of Plasmodium falciparum RIFIN and STEVOR variant surface antigen families

Background RIFINs and STEVORs are variant surface antigens expressed by P. falciparum that play roles in severe malaria pathogenesis and immune evasion. These two highly diverse multigene families feature multiple paralogs, making their classification challenging using traditional bioinformatic methods. Results STRIDE (STevor and RIfin iDEntifier) is an HMM-based, command-line program that automates the identification and classification of RIFIN and STEVOR protein sequences in the malaria parasite Plasmodium falciparum. STRIDE is more sensitive in detecting RIFINs and STEVORs than available PFAM and TIGRFAM tools and reports RIFIN subtypes and the number of sequences with a FHEYDER amino acid motif, which has been associated with severe malaria pathogenesis. Conclusions STRIDE will be beneficial to malaria research groups analyzing genome sequences and transcripts of clinical field isolates, providing insight into parasite biology and virulence. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04515-8.

activation, weakening host defenses against malaria infection [6]. Both protein families are also targets of natural immunity [7].
RIFINs and STEVORs pose challenges in genomic analyses due to their immense genetic diversity and numerous paralogs, which cause difficulties in reference-based assembly and identification. There are limited bioinformatic approaches to distinguish between RIFINs and STEVORs and to further classify RIFINs to the subtype level. Apart from laborious sequence alignment and phylogenetic analyses, BLAST is one of the few available tools [8]. However, BLAST requires a comprehensive reference index, lacks the sensitivity to detect highly divergent sequences, and cannot readily delineate between RIFIN subtypes. In contrast, profile Hidden Markov Models (HMM) offer not only better accuracy and speed, but also sensitivity in detecting remote homologs [9]. Three HMM-based tools have been used to categorize RIFIN and STEVOR sequences: RSpred [4], TIGRFAM [10], and PFAM [11]; however, each is built using limited sets of reference RIFIN and/or STEVOR sequences. The more recent tools TIGRFAM and PFAM, as part of the Interpro database [11], do not subtype RIFINs or automatically assign annotations. While RSpred addressed these concerns, it was web-based, could only evaluate ten sequences per job, and its web interface is no longer responsive.
Here, we introduce an improved HMM-based, command-line program called STRIDE (STevor and RIfin iDEntifier). STRIDE has better sensitivity than available HMM tools to detect both RIFINs and STEVORs, and also features RIFIN subtyping, automated annotations, and adjustable thresholds for sensitivity and specificity. Importantly, STRIDE allows for the determination of the number of RIFIN-A sequences with a FHEYDER motif, providing insight into mechanisms to weaken host defenses. STRIDE will have particular value for malaria genomic epidemiologists, as next-generation sequencing of clinical field isolates increases in prevalence and the contributions of RIFIN and STE-VOR multigene families to severe malaria pathogenesis and the acquisition of natural immunity to malaria become clearer.

Implementation
STRIDE consists of a merged HMM generated from three different refined multiple sequence alignments of full-length publicly available RIFIN and STEVOR protein sequences (Additional file 1: Figs. S1 and S2). A total of 3536 RIFIN and STEVOR sequences were downloaded from PlasmoDB (Release 45; August 28, 2019, keyword: "RIFIN/STEVOR"). Redundant sequences were clustered with CD-HIT v4.6 (option: -c 1.0). RIFIN-A, RIFIN-B, and STEVOR proteins were first identified via BLAST. For each set of protein sequences, a multiple sequence alignment was created, and a corresponding HMM was generated with hmmbuild (default parameters) as part of the HMMER3 v3.2.1 package. In an iterative process (Additional file 1: Fig. S1), we used each HMM profile to search for homologous sequences in other datasets. Sequences with the highest scores were incorporated into a new seed alignment, where another respective HMM profile was created. Training concluded for each HMM profile when no additional sequences could be extracted.
STRIDE uses a FASTA file as input and scores the query sequences against the HMM profile. A subprogram written in Perl v5.24 parses these scores and outputs the sequence classifications as a tab-delimited text file (Additional file 2). The main classifications are "RIFIN-A", "RIFIN-B", and "STEVOR. " STRIDE outputs the number of RIFIN-As with a FHEYDER amino acid motif as an exact pattern match. Truncated or highly divergent sequences are designated as "likely" RIFIN or STEVOR, and those that are unable to meet RIFIN subtyping criteria due to insufficient discriminatory characteristics (e.g. missing the protein segment containing the defining 25 amino acid indel) are called simply "RIFIN. " To determine sensitivities and specificities, we created a "validation" dataset that spanned a range of variant surface antigen sequence sizes, including 3888 presumed RIFINs and STEVORs from sequenced clinical isolates and publicly available assemblies (Table 1, Additional file 1: Fig. S2) [12]. In addition, we downloaded annotated protein FASTA files from several Plasmodium reference genomes: P. falciparum 3D7 (5548 sequences), P. vivax (6667 sequences), P. berghei strain ANKA (5076 sequences), P. reichenowi (5644 sequences), and P. chabaudi (5217 sequences) to test our profiles for false positives and negatives.

Generation of HMM profiles
From the 3536 RIFIN and STEVOR sequences downloaded from PlasmoDB, 967 RIFIN-A, 495 RIFIN-B, and 229 STEVOR sequences comprised the final datasets at the conclusion of HMM training (Fig. 2, Additional file 1: Fig. S2). This included representation of sequences from all sampled genomes. The Malian (ML01) and Togo (TG01) strains were polyclonal and had higher overall numbers of representative sequences. Of the 228 total RIFINs and STEVORs annotated in the 3D7 reference genome, STRIDE incorporated 122 of these sequences.

Performance evaluation
The sensitivity and specificity of STRIDE is adjustable, although default parameters have been optimized to produce the most conservative designations (Fig. 3, Additional file 2). Datasets of 404 RIFIN-A, 476 RIFIN-B, and 40 STEVOR sequences that were randomly selected and excluded from the HMM training were used to test and define the limits of detection for each profile (Fig. 3, Additional file 1: Figs. S1 and S2). All RIFIN-A and -B sequences had low concordance to the STEVOR profile, failing to meet the STEVOR threshold score of 145. The 404 RIFIN-A sequences had whole sequence (represented in Table 1 Comparison of STRIDE to PFAM and TIGRFAM, using the same parameter values Based on these findings, we developed an algorithm to specify the type and subtype of a queried sequence based on whole sequence and domain scores (Additional file 2). The first limit of detection determines which of the three profiles (RIFIN-A, RIFIN-B, or STEVOR) registered the greatest whole sequence score. For a queried sequence to be considered a RIFIN, the whole sequence score must surpass a threshold of 200 against either the RIFIN-A or RIFIN-B profile. Queries with whole sequence scores between 100 and 200 to a RIFIN profile are considered "likely RIFINs" and scores ≤ 100 are considered "unlikely RIFINs". RIFIN subtyping requires a domain score ≥ 250 to a respective RIFIN profile, otherwise a query receives only a "RIFIN" annotation. Similarly, for the STEVOR HMM profiles, scores between 100 and 145 were considered "likely STEVORs, " and scores ≤ 100 were "unlikely STEVORs. " STRIDE does not report queries that are vastly different to any of the profiles.

Discussion
To compare sensitivity and specificity between tools, we adjusted the parameters of PFAM and TIGRFAM to match those of STRIDE. STRIDE detected STEVORs in the curated 3D7 reference genome with similar sensitivity to PFAM and TIGRFAM, although sensitivity of STRIDE to detect RIFINs was higher, but this was not statistically significant (p = 0.30; χ 2 = 2.41, DF = 2, Table 2). Specificity to 3D7 sequences was equivalent across all tools. Unlike PFAM and TIGRFAM, STRIDE was not trained using the entirety of RIFINs and STEVORs from the 3D7 repertoire (Fig. 2, Additional file 1:  Fig. S4).
The "validation" dataset spanned a range of variant surface antigen sequence sizes, which included 3888 presumed RIFINs and STEVORs from sequenced clinical isolates and publicly available assemblies (Table 1). STRIDE detected a total of 3540 RIFIN and STEVOR sequences (91.0%), more than the counts for PFAM (2707, 69.6%; p < 0.00001, χ 2 = 31.30, DF = 1) or for TIGRFAM (3394, 87.3%; p = 0.31716, χ 2 = 1.00, DF = 1). We also used other Plasmodium reference genomes to further test for specificity. STRIDE appropriately detected RIFINs and STEVORs in gorilla-and chimpanzee-infecting parasites (e.g. P. reichenowi) but did not register any hits to the genomes of P. vivax, berghei, or chabaudi, three species that lack RIFIN and STEVOR orthologs (Table 1). Using STRIDE, we reevaluated a subset of 320 sequences from PlasmoDB that had received a broad, overlapping "RIFIN/STEVOR family, putative" designation (Additional file 3). These sequences originated from long read-based assemblies of several parasite strains [13]. Among the 312 sequences that met or exceeded identification thresholds, 176 were determined to be RIFIN-As, including 52 with FHEYDER motifs; 80 were RIFIN-Bs; and 56 were STEVORs. Eight sequences did not meet the designated limits of detection for exact classifications. These were mostly truncated copies and thus classified by STRIDE as "RIFIN" or "likely RIFIN. " We also applied STRIDE to predict the number and classification of RIFINs and STEVORs from 15 unannotated long read-based de novo assemblies of clinical field isolates (Additional file 3) [12]. Initial classification using BLASTp led to mixed results and overlapping annotations. The number of STRIDE-predicted RIFINs and STE-VORs from the NF54 de novo assembly mirrored that of 3D7, which was expected given that 3D7 is a clone of the NF54 isolate [14]. STRIDE also consistently identified comparable numbers of RIFINs, STEVORs, and FHEYDER motifs across most clinical samples from diverse geographies. Several "likely RIFIN" sequences from each assembly are encoded by short, truncated contigs in each assembly and could not be precisely classified. There were proportionally greater numbers of sequences found in the Myanmar samples, which are likely polyclonal (Additional file 3).

Conclusions
We present STRIDE, an HMM-based, command-line program that automates RIFIN and STEVOR prediction, differentiates RIFIN-As from RIFIN-Bs, and identifies the number of sequences with the known pathogenic FHEYDER motif. STRIDE eliminates the need to perform multiple sequence alignments and phylogenetic analyses Table 2 Depicting the sensitivity and specificity analyses of STRIDE compared to PFAM # and TIGRFAM # using 3D7# We made comparisons across tools using the same parameters as STRIDÊ The curated 3D7 reference genome served as a gold standard. There are a total of 5548 sequences in the P. falciparum 3D7 reference, where 182 sequences are annotated as RIFINs and 43 sequences are annotated as STEVORs (includes 27 RIFIN and 10 STEVOR pseudogenes). Unlike PFAM and TIGRFAM, STRIDE was not trained with the entire 3D7 RIFIN and STEVOR repertoire (Fig. 2). The bolded text illustrates the sensitivity of each program; all three tools had 100% specificity